We often think of Momentum as a means of dampening oscillations
and speeding up the iterations, leading to faster convergence. But it
has other interesting behavior. It allows a larger range of step-sizes
to be used, and creates its own oscillations. What is going on?
Here’s a popular story about momentum [1, 2, 3]:
gradient descent is a man walking down a hill. He follows the steepest
path downwards; his progress is slow, but steady. Momentum is a heavy
ball rolling down the same hill. The added inertia acts both as a
smoother and an accelerator, dampening oscillations and causing us to
barrel through narrow valleys, small humps and local minima.
This standard story isn’t wrong, but it fails to explain many
important behaviors of momentum. In fact, momentum can be understood far
more precisely if we study it on the right model.
One nice model is the convex quadratic. This model is rich enough to
reproduce momentum’s local dynamics in real problems, and yet simple
enough to be understood in closed form. This balance gives us powerful
traction for understanding this algorithm.
We begin with gradient descent. The algorithm has many virtues, but
speed is not one of them. It is simple — when optimizing a smooth
function f, we make a small step in the gradient
wk+1=wk−α∇f(wk).
For a step-size small enough, gradient descent makes a monotonic
improvement at every iteration. It always converges, albeit to a local
minimum. And under a few weak curvature conditions it can even get there
at an exponential rate.
But the exponential decrease, though appealing in theory, can often be
infuriatingly small. Things often begin quite well — with an
impressive, almost immediate decrease in the loss. But as the iterations
progress, things start to slow down. You start to get a nagging feeling
you’re not making as much progress as you should be. What has gone
wrong?
The problem could be the optimizer’s old nemesis, pathological curvature. Pathological curvature is, simply put, regions of f
which aren’t scaled properly. The landscapes are often described as
valleys, trenches, canals and ravines. The iterates either jump between
valleys, or approach the optimum in small, timid steps. Progress along
certain directions grind to a halt. In these unfortunate regions,
gradient descent fumbles.
Momentum proposes the following tweak to gradient descent. We give gradient descent a short-term memory:
zk+1wk+1=βzk+∇f(wk)=wk−αzk+1
The change is innocent, and costs almost nothing. When β=0 , we recover gradient descent. But for β=0.99 (sometimes 0.999,
if things are really bad), this appears to be the boost we need. Our
iterations regain that speed and boldness it lost, speeding to the
optimum with a renewed energy.
Optimizers call this minor miracle “acceleration”.
The new algorithm may seem at first glance like a cheap hack. A simple
trick to get around gradient descent’s more aberrant behavior — a
smoother for oscillations between steep canyons. But the truth, if
anything, is the other way round. It is gradient descent which is the
hack. First, momentum gives up to a quadratic speedup on many functions.
1
This is no small matter — this is similar to the speedup you get from
the Fast Fourier Transform, Quicksort, and Grover’s Algorithm. When the
universe gives you quadratic speedups, you should start to pay
attention.
But there’s more. A lower bound, courtesy of Nesterov [5],
states that momentum is, in a certain very narrow and technical sense,
optimal. Now, this doesn’t mean it is the best algorithm for all
functions in all circumstances. But it does satisfy some curiously
beautiful mathematical properties which scratch a very human itch for
perfection and closure. But more on that later. Let’s say this for
now — momentum is an algorithm for the book.
First Steps: Gradient Descent
We begin by studying gradient descent on the simplest model possible which isn’t trivial — the convex quadratic,
f(w)=21wTAw−bTw,w∈Rn.
Assume A is symmetric and invertible, then the optimal solution w⋆ occurs at
w⋆=A−1b.
Simple as this model may be, it is rich enough to approximate many functions (think of A as your favorite model of curvature — the Hessian, Fisher Information Matrix [6],
etc) and captures all the key features of pathological curvature. And
more importantly, we can write an exact closed formula for gradient
descent on this function.
This is how it goes. Since ∇f(w)=Aw−b, the iterates are
wk+1=wk−α(Awk−b).
Here’s the trick. There is a very natural space to view gradient
descent where all the dimensions act independently — the eigenvectors of
A.
Every symmetric matrix A has an eigenvalue decomposition
A=Qdiag(λ1,…,λn)QT,Q=[q1,…,qn],
and, as per convention, we will assume that the λi’s are sorted, from smallest λ1 to biggest λn. If we perform a change of basis, xk=QT(wk−w⋆), the iterations break apart, becoming:
xik+1=xik−αλixik=(1−αλi)xik=(1−αλi)k+1xi0
Moving back to our original space w, we can see that
wk−w⋆=Qxk=i∑nxi0(1−αλi)kqi
and there we have it — gradient descent in closed form.
Decomposing the Error
The above equation admits a simple interpretation. Each element of x0 is the component of the error in the initial guess in the Q-basis. There are n
such errors, and each of these errors follows its own, solitary path to
the minimum, decreasing exponentially with a compounding rate of 1−αλi. The closer that number is to 1, the slower it converges.
For most step-sizes, the eigenvectors with largest eigenvalues
converge the fastest. This triggers an explosion of progress in the
first few iterations, before things slow down as the smaller
eigenvectors’ struggles are revealed. By writing the contributions of
each eigenspace’s error to the loss
f(wk)−f(w⋆)=∑(1−αλi)2kλi[xi0]2
we can visualize the contributions of each error component to the loss.
Optimization can be seen as combination of several component problems, shown here as 1 2 3 with eigenvalues λ1=0.01, λ2=0.1, and λ3=1 respectively.
Step-size
Optimal Step-size
Choosing A Step-size
The above analysis gives us immediate guidance as to how to set a step-size α. In order to converge, each ∣1−αλi∣ must be strictly less than 1. All workable step-sizes, therefore, fall in the interval
0<αλi<2.
The overall convergence rate is determined by the slowest error component, which must be either λ1 or λn:
rate(α)=imax∣1−αλi∣=max{∣1−αλ1∣,∣1−αλn∣}
This overall rate is minimized when the rates for λ1 and λn
are the same — this mirrors our informal observation in the previous
section that the optimal step-size causes the first and last
eigenvectors to converge at the same rate. If we work this through we
get:
optimal α=αargminrate(α)optimal rate=αminrate(α)=λ1+λn2=λn/λ1+1λn/λ1−1
Notice the ratio λn/λ1
determines the convergence rate of the problem. In fact, this ratio
appears often enough that we give it a name, and a symbol — the
condition number.
condition number:=κ:=λ1λn
The condition number means many things. It is a measure of how close to singular a matrix is. It is a measure of how robust A−1b is to perturbations in b. And, in this context, the condition number gives us a measure of how poorly gradient descent will perform. A ratio of κ=1
is ideal, giving convergence in one step (of course, the function is
trivial). Unfortunately the larger the ratio, the slower gradient
descent will be. The condition number is therefore a direct measure of
pathological curvature.
Example: Polynomial Regression
The above analysis reveals an insight: all errors are not made equal. Indeed, there are different kinds of errors, n to be exact, one for each of the eigenvectors of A. And gradient descent is better at correcting some kinds of errors than others. But what do the eigenvectors of A mean? Surprisingly, in many applications they admit a very concrete interpretation.
Lets see how this plays out in polynomial regression. Given 1D data, ξi, our problem is to fit the model
model(ξ)=w1p1(ξ)+⋯+wnpn(ξ)pi=ξ↦ξi−1
to our observations, di. This model, though nonlinear in the input ξ, is linear in the weights, and therefore we can write the model as a linear combination of monomials, like:
Because of the linearity, we can fit this model to our data ξi using linear regression on the model mismatch
minimizew21i∑(model(ξi)−di)2=21∥Zw−d∥2
where
Z=⎝⎜⎜⎛11⋮1ξ1ξ2⋮ξmξ12ξ22⋮ξm2……⋱…ξ1n−1ξ2n−1⋮ξmn−1⎠⎟⎟⎞.
The path of convergence, as we know, is elucidated when we view the iterates in the space of Q (the eigenvectors of ZTZ). So let’s recast our regression problem in the basis of Q. First, we do a change of basis, by rotating w into Qw, and counter-rotating our feature maps p into eigenspace, p¯. We can now conceptualize the same regression as one over a different polynomial basis, with the model
model(ξ)=x1p¯1(ξ)+⋯+xnp¯n(ξ)p¯i=∑qijpj.
This model is identical to the old one. But these new features p¯
(which I call “eigenfeatures”) and weights have the pleasing property
that each coordinate acts independently of the others. Now our
optimization problem breaks down, really, into n
small 1D optimization problems. And each coordinate can be optimized
greedily and independently, one at a time in any order, to produce the
final, global, optimum. The eigenfeatures are also much more
informative:
The data comes in 2 clusters. The first 2 eigenfeatures capture variations between the clusters. Next there are smooth variations within clusters, peaks within clusters,and finally, jagged polynomials which differ wildly on neighboring points. drag points to fit data
The observations in the above diagram can be justified mathematically.
From a statistical point of view, we would like a model which is, in
some sense, robust to noise. Our model cannot possibly be meaningful if
the slightest perturbation to the observations changes the entire model
dramatically. And the eigenfeatures, the principal components of the
data, give us exactly the decomposition we need to sort the features by
its sensitivity to perturbations in di’s.
The most robust components appear in the front (with the largest
eigenvalues), and the most sensitive components in the back (with the
smallest eigenvalues).
This measure of robustness, by a rather convenient coincidence, is
also a measure of how easily an eigenspace converges. And thus, the
“pathological directions” — the eigenspaces which converge the
slowest — are also those which are most sensitive to noise! So starting
at a simple initial point like 0
(by a gross abuse of language, let’s think of this as a prior), we
track the iterates till a desired level of complexity is reached. Let’s
see how this plays out in gradient descent.
Eigenvalue 160152654,53677,6521,329,083Step k = 49
When an eigenspace has converged to three significant digits, the bar greys out. Drag the observations to change fit.We begin at x=w=0
This effect is harnessed with the heuristic of early stopping : by
stopping the optimization early, you can often get better generalizing
results. Indeed, the effect of early stopping is very similar to that of
more conventional methods of regularization, such as Tikhonov
Regression. Both methods try to suppress the components of the smallest
eigenvalues directly, though they employ different methods of spectral
decay.2
But early stopping has a distinct advantage. Once the step-size is
chosen, there are no regularization parameters to fiddle with. Indeed,
in the course of a single optimization, we have the entire family of
models, from underfitted to overfitted, at our disposal. This gift, it
seems, doesn’t come at a price. A beautiful free lunch [7] indeed.
The Dynamics of Momentum
Let’s turn our attention back to momentum. Recall that the momentum update is
zk+1wk+1=βzk+∇f(wk)=wk−αzk+1.
Since ∇f(wk)=Awk−b, the update on the quadratic is
zk+1wk+1=βzk+(Awk−b)=wk−αzk+1.
Following [8], we go through the same motions, with the change of basis xk=Q(wk−w⋆) and yk=Qzk, to yield the update rule
yik+1xik+1=βyik+λixik=xik−αyik+1.
in which each component acts independently of the other components (though xik and yik are coupled). This lets us rewrite our iterates as
3(yikxik)=Rk(yi0xi0)R=(β−αβλi1−αλi).
There are many ways of taking a matrix to the kth power. But for the 2×2 case there is an elegant and little known formula [9] in terms of the eigenvalues of R, σ1 and σ2.
Rk={σ1kR1−σ2kR2σ1k(kR/σ1−(k−1)I)σ1≠σ2σ1=σ2,Rj=σ1−σ2R−σjI
This formula is rather complicated, but the takeaway here is that it
plays the exact same role the individual convergence rates, 1−αλi
do in gradient descent. But instead of one geometric series, we have
two coupled series, which may have real or complex values. The
convergence rate is therefore the slowest of the two rates, max{∣σ1∣,∣σ2∣}4.
By plotting this out, we see there are distinct regions of the
parameter space which reveal a rich taxonomy of convergence behavior [10]:
s1 = s2 = complex
02410Momentum
β=Step-size
α=
0.00.20.40.60.81.01.2
Convergence Rate
A plot of max{∣σ1∣,∣σ2∣} reveals distinct regions, each with its own style of convergence.
xik−xi∗RipplesR's eigenvalues are complex, and the iterates display low frequency ripples. Surprisingly, the convergence rate
2√β is independent of
α and
λi.
xik−xi∗Monotonic DecreaseR's eigenvalues are both real, are positive, and have norm less than one. The behavior here resembles gradient descent.
xik−xi∗1-Step ConvergenceWhen
α=1/λi, and
β=0, we converge in one step. This is a very special point, and kills the error in the eigenspace completely.
xik−xi∗Monotonic OscillationsWhen
α>1/λi, the iterates flip between
+ and
− at each iteration. These are often referred to as 'oscillations' in gradient descent.
xik−xi∗DivergenceWhen
max{∣σ1∣,∣σ2∣}>1, the iterates diverge.
For what values of α and β does momentum converge? Since we need both σ1 and σ2 to converge, our convergence criterion is now max{∣σ1∣,∣σ2∣}<1. The range of available step-sizes work out 5 to be
0<αλi<2+2βfor0≤β<1
We recover the previous result for gradient descent when β=0. But notice an immediate boon we get. Momentum allows us to crank up the step-size up by a factor of 2 before diverging.
The Critical Damping Coefficient
The true magic happens, however, when we find the sweet spot of α and β. Let us try to first optimize over β.
Momentum admits an interesting physical interpretation when α is [11]
small: it is a discretization of a damped harmonic oscillator. Consider
a physical simulation operating in discrete time (like a video game).
yik+1=+λixik
and perturbed by an external force field
We can think of −yik as velocityβyik
which is dampened at each step
xik+1=xik−αyik+1
And x is our particle’s position
which is moved at each step by a small amount in the direction of the velocity yik+1.
We can break this equation apart to see how each component affects the dynamics of the system. Here we plot, for 150 iterates, the particle’s velocity (the horizontal axis) against its position (the vertical axis), in a phase diagram.
β = 10.9850.950.90.850.8 λ = 0 0.02 0.1 0.25VelocityDampingExternal ForcePositionβ: Horizontal Axis When
λi=0 and
β=1, the object moves at constant speed. As
β goes down, the particle decelerates, losing a proportion of its energy at each tick. λ: Vertical Axis
The external force causes the particle to return to the origin.
Combining damping and the force field, the particle behaves like a
damped harmonic oscillator, returning lazily to equlibrium.
Initial point x = 1, y = 1
This system is best imagined as a weight suspended on a spring. We
pull the weight down by one unit, and we study the path it follows as it
returns to equilibrium. In the analogy, the spring is the source of our
external force λixik, and equilibrium is the state when both the position xik and the speed yik are 0. The choice of β crucially affects the rate of return to equilibrium.
reachesoptimum
Overdamping
When
β is too small (e.g. in Gradient Descent,
β=0), we're over-damping. The particle is immersed in a viscous fluid which saps it of its kinetic energy at every timestep.
Critical Damping
The best value of
β lies in the middle of the two extremes. This sweet spot happens when the eigenvalues of
R are repeated, when
β=(1−√αλi)2.
Underdamping
When
β
is too large we're under-damping. Here the resistance is too small, and
spring oscillates up and down forever, missing the optimal value over
and over.
1.00.80.60.40.20.0Momentum β
The critical value of β=(1−√αλi)2 gives us a convergence rate (in eigenspace i) of 1−√αλi. A square root improvement over gradient descent, 1−αλi! Alas, this only applies to the error in the ith eigenspace, with α fixed.
Optimal parameters
To get a global convergence rate, we must optimize over both α and β. This is a more complicated affair,6 but they work out to be
α=(√λ1+√λn2)2β=(√λn+√λ1√λn−√λ1)2
Plug this into the convergence rate, and you get
With barely a modicum of extra effort, we have essentially square
rooted the condition number! These gains, in principle, require explicit
knowledge of λ1 and λn. But the formulas reveal a simple guideline. When the problem’s conditioning is poor, the optimal α is approximately twice that of gradient descent, and the momentum term is close to 1. So set β as close to 1 as you can, and then find the highest α which still converges. Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.
We can do the same decomposition here with momentum, with eigenvalues
λ1=0.01,
λ2=0.1, and
λ3=1. Though the decrease is no longer monotonic, but significantly faster. f(wk)−f(w⋆)
Note that the optimal parameters do not necessarily imply the fastest
convergence, though, only the fastest asymptotic convergence rate.
Step-size α =
Momentum β =
02401
020406080100120140Eigenvalue 123
Optimal parameters
While the loss function of gradient descent had a graceful, monotonic
curve, optimization with momentum displays clear oscillations. These
ripples are not restricted to quadratics, and occur in all kinds of
functions in practice. They are not cause for alarm, but are an
indication that extra tuning of the hyperparameters is required.
Example: The Colorization Problem
Let’s look at how momentum accelerates convergence with a concrete example. On a grid of pixels let G be the graph with vertices as pixels, E be the set of edges connecting each pixel to its four neighboring pixels, and D be a small set of a few distinguished vertices. Consider the problem of minimizing
minimize21i∈D∑(wi−1)2
The colorizer pulls distinguished pixels towards 1
+21i,j∈E∑(wi−wj)2.
The smoother spreads out the color
The optimal solution to this problem is a vector of all 1’s 7.
An inspection of the gradient iteration reveals why we take a long time
to get there. The gradient step, for each component, is some form of
weighted average of the current value and its neighbors:
wik+1=wik−αj∈N∑(wik−wjk)−{α(wik−1)0i∈Di∉D
This kind of local averaging is effective at smoothing out local
variations in the pixels, but poor at taking advantage of global
structure. The updates are akin to a drop of ink, diffusing through
water. Movement towards equilibrium is made only through local
corrections and so, left undisturbed, its march towards the solution is
slow and laborious. Fortunately, momentum speeds things up
significantly.
The eigenvectors of the colorization problem form a generalized Fourier basis for Rn.
The smallest eigenvalues have low frequencies, hence gradient descent
corrects high frequency errors well but not low frequency ones.
020406080100
Each square represents a node, colored by its weight. Edges connect each square to the four neighboring squares.Weights
02401
Step-size α =
Momentum β =
Eigenvalue 125501500200005751,07415,367219,695Step k = 20
Optimal parameters
In vectorized form, the colorization problem is
minimize
The smoother’s quadratic form is the Graph Laplacian21i∈D∑(xTeieiTx−eiTx)+21xTLGx
And the colorizer is a small low rank correction with a linear term. ei is the ith unit vector.
The Laplacian matrix, LG8,
which dominates the behavior of the optimization problem, is a valuable
bridge between linear algebra and graph theory. This is a rich field of
study, but one fact is pertinent to our discussion here. The
conditioning of LG,
here defined as the ratio of the second eigenvector to the last (the
first eigenvalue is always 0, with eigenvector equal to the matrix of
all 1′s), is directly connected to the connectivity of the graph.
012345678910111213141516171819202122232425262728293031323334350,00,10,20,30,40,51,01,11,21,31,41,52,02,12,22,32,42,53,03,13,23,33,43,54,04,14,24,34,44,55,05,15,25,35,45,501234567891011121314151617181920212223242526272829303132333435Small world graphs, like expanders and dense graphs, have excellent conditioningThe conditioning of grids improves with its dimensionality.And long, wiry graphs, like paths, condition poorly.
These observations carry through to the colorization problem, and the
intuition behind it should be clear. Well connected graphs allow rapid
diffusion of information through the edges, while graphs with poor
connectivity do not. And this principle, taken to the extreme, furnishes
a class of functions so hard to optimize they reveal the limits of
first order optimization.
The Limits of Descent
Let’s take a step back. We have, with a clever trick, improved the
convergence of gradient descent by a quadratic factor with the
introduction of a single auxiliary sequence. But is this the best we can
do? Could we improve convergence even more with two sequences? Could
one perhaps choose the α’s and β’s intelligently and adaptively? It is tempting to ride this wave of optimism - to the cube root and beyond!
Unfortunately, while improvements to the momentum algorithm do exist,
they all run into a certain, critical, almost inescapable lower bound.
Adventures in Algorithmic Space
To understand the limits of what we can do, we must first formally
define the algorithmic space in which we are searching. Here’s one
possible definition. The observation we will make is that both gradient
descent and momentum can be “unrolled”. Indeed, since
w1w2wk+1===⋮=w0−α∇f(w0)w1−α∇f(w1)w0−α∇f(w0)−α∇f(w1)w0−α∇f(w0)−⋯⋯−α∇f(wk)
we can write gradient descent as
wk+1=w0−αi∑k∇f(wi).
A similar trick can be done with momentum:
wk+1=w0+αi∑k1−β(1−βk+1−i)∇f(wi).
In fact, all manner of first order algorithms, including the Conjugate
Gradient algorithm, AdaMax, Averaged Gradient and more, can be written
(though not quite so neatly) in this unrolled form. Therefore the class
of algorithms for which
wk+1=w0+i∑kγik∇f(wi) for some γik
contains momentum, gradient descent and a whole bunch of other
algorithms you might dream up. This is what is assumed in Assumption
2.1.4 [5] of Nesterov. But let’s push this even further, and expand this class to allow different step-sizes for different directions.
wk+1=w0+i∑kΓik∇f(wi) for some diagonal matrix Γik.
This class of methods covers most of the popular algorithms for
training neural networks, including ADAM and AdaGrad. We shall refer to
this class of methods as “Linear First Order Methods”, and we will show a
single function all these methods ultimately fail on.
The Resisting Oracle
Earlier, when we talked about the colorizer problem, we observed that
wiry graphs cause bad conditioning in our optimization problem. Taking
this to its extreme, we can look at a graph consisting of a single
path — a function so badly conditioned that Nesterov called a variant of
it “the worst function in the world”. The function follows the same
structure as the colorizer problem, and we shall call this the Convex
Rosenbrock,
fn(w)=
with a colorizer of one node
21(w1−1)2+21i=1∑n(wi−wi+1)2
strong couplings of adjacent nodes in the path,
+κ−12∥w∥2.
and a small regularization term.
The optimal solution of this problem is
wi⋆=(√κ+1√κ−1)i
and the condition number of the problem fn approaches κ as n goes to infinity. Now observe the behavior of the momentum algorithm on this function, starting from w0=0.
02401
Step-size α =
Momentum β =
Here we see the first 50 iterates of momentum on the Convex Rosenbrock for n=25. The behavior here is similar to that of any Linear First Order Algorithm.
This triangle is a “dead zone” of our iterates. The iterates are always 0, no matter what the parameters. The
remaining expanding space is the “light cone” of our iterate’s
influence. Momentum does very well here with the optimal parameters.
Iteration 01020304050
Error
0.00.20.40.60.81.0
Weights
0.00.20.40.60.81.0
Optimal parameters
The observations made in the above diagram are true for any Linear
First Order algorithm. Let us prove this. First observe that each
component of the gradient depends only on the values directly before and
after it:
∇f(x)i=2wi−wi−1−wi+1+κ−14wi,i≠1.
Therefore the fact we start at 0 guarantees that that component must
remain stoically there till an element either before or after it turns
nonzero. And therefore, by induction, for any linear first order
algorithm,
Think of this restriction as a “speed of light” of information transfer. Error signals will take at least k steps to move from w0 to wk. We can therefore sum up the errors which cannot have changed yet9:
∥wk−w⋆∥∞≥i≥k+1max{∣wi⋆∣}=(√κ+1√κ−1)k+1=(√κ+1√κ−1)k∥w0−w⋆∥∞.
As n gets large, the condition number of fn approaches κ.
And the gap therefore closes; the convergence rate that momentum
promises matches the best any linear first order algorithm can do. And
we arrive at the disappointing conclusion that on this problem, we
cannot do better.
Like many such lower bounds, this result must not be taken literally,
but spiritually. It, perhaps, gives a sense of closure and finality to
our investigation. But this is not the final word on first order
optimization. This lower bound does not preclude the possibility, for
example, of reformulating the problem to change the condition number
itself! There is still much room for speedups, if you understand the
right places to look.
Momentum with Stochastic Gradients
There is a final point worth addressing. All the discussion above
assumes access to the true gradient — a luxury seldom afforded in modern
machine learning. Computing the exact gradient requires a full pass
over all the data, the cost of which can be prohibitively expensive.
Instead, randomized approximations of the gradient, like minibatch
sampling, are often used as a plug-in replacement of ∇f(w). We can write the approximation in two parts,
∇f(w)
the true gradient
+error(w).
and an approximation error. If the estimator is unbiased e.g. E[error(w)]=0
It is helpful to think of our approximate gradient as the injection of
a special kind of noise into our iteration. And using the machinery
developed in the previous sections, we can deal with this extra term
directly. On a quadratic, the error term cleaves cleanly into a separate
term, where 10
(yikxik)
the noisy iterates are a sum of
=Rk(yi0xi0)
the noiseless, deterministic iterates and
+ϵikj=1∑kRk−j(1−α)
a decaying sum of the errors, where ϵk=Q⋅error(wk).
The error term, ϵk, with its dependence on the wk, is a fairly hairy object. Following [10],
we model this as independent 0-mean Gaussian noise. In this simplified
model, the objective also breaks into two separable components, a sum of
a deterministic error and a stochastic error
11, visualized here.
We decompose the expected value of the objective value Ef(w)−f(w⋆) into a deterministic part and a stochastic part . Ef(w)−f(w⋆) The small black dots are a single run of stochastic gradient
Step-size α =
Momentum β =
As [1]
observes, the optimization has two phases. In the initial transient
phase the magnitude of the noise is smaller than the magnitude of the
gradient, and Momentum still makes good progress. In the second,
stochastic phase, the noise overwhelms the gradient, and momentum is
less effective.
Note that there are a set of unfortunate tradeoffs which seem to pit
the two components of error against each other. Lowering the step-size,
for example, decreases the stochastic error, but also slows down the
rate of convergence. And increasing momentum, contrary to popular
belief, causes the errors to compound. Despite these undesirable
properties, stochastic gradient descent with momentum has still been
shown to have competitive performance on neural networks. As [1]
has observed, the transient phase seems to matter more than the
fine-tuning phase in machine learning. And in fact, it has been recently
suggested [12]
that this noise is a good thing — it acts as a implicit regularizer,
which, like early stopping, prevents overfitting in the fine-tuning
phase of optimization.
Onwards and Downwards
The study of acceleration is seeing a small revival within the
optimization community. If the ideas in this article excite you, you may
wish to read [13],
which fully explores the idea of momentum as the discretization of a
certain differential equation. But other, less physical, interpretations
exist. There is an algebraic interpretation of momentum in terms of
approximating polynomials [3, 14]. Geometric interpretations are emerging [15, 16],
connecting momentum to older methods, like the Ellipsoid method. And
finally, there are interpretations relating momentum to duality [17], perhaps providing a clue as how to accelerate second order methods and Quasi Newton (for a first step, see [18]).
But like the proverbial blind men feeling an elephant, momentum seems
like something bigger than the sum of its parts. One day, hopefully
soon, the many perspectives will converge into a satisfying whole.
Acknowledgments
I am deeply indebted to the editorial contributions of Shan Carter and
Chris Olah, without which this article would be greatly impoverished.
Shan Carter provided complete redesigns of many of my original
interactive widgets, a visual coherence for all the figures, and
valuable optimizations to the page’s performance. Chris Olah provided
impeccable editorial feedback at all levels of detail and abstraction -
from the structure of the content, to the alignment of equations.
I am also grateful to Michael Nielsen for providing the title of this
article, which really tied the article together. Marcos Ginestra
provided editorial input for the earliest drafts of this article, and
spiritual encouragement when I needed it the most. And my gratitude
extends to my reviewers, Matt Hoffman and Anonymous Reviewer B for their
astute observations and criticism. I would like to thank Reviewer B, in
particular, for pointing out two non-trivial errors in the original
manuscript (discussion here). The contour plotting library for the hero visualization is the joint work of Ben Frederickson, Jeff Heer and Mike Bostock.
Many thanks to the numerous pull requests and issues filed on github.
Thanks in particular, to Osemwaro Pedro for spotting an off by one error
in one of the equations. And also to Dan Schmidt who did an editing
pass over the whole project, correcting numerous typographical and
grammatical errors.
It is possible, however, to
construct very specific counterexamples where momentum does not
converge, even on convex functions. See [4] for a counterexample.
In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
minimize21∥Zw−d∥2+2η∥w∥2=21wT(ZTZ+ηI)w−(Zd)Tw
Recall that ZTZ=Qdiag(Λ1,…,Λn)QT. The solution to Tikhonov Regression is therefore
(ZTZ+ηI)−1(Zd)=Qdiag(λ1+η1,⋯,λn+η1)QT(Zd)
We can think of regularization as a function which decays the largest eigenvalues, as follows:
Tikhonov Regularized λi=λi+η1=λi1(1−(1+λi/η)−1).
Gradient descent can be seen as employing a similar decay, but with the decay rate
Gradient Descent Regularized λi=λi1(1−(1−αλi)k)
instead. Note that this decay is dependent on the step-size.
This is true as we can write updates in matrix form as
(1α01)(yik+1xik+1)=(β0λi1)(yikxik)
which implies, by inverting the matrix on the left,
(yik+1xik+1)=(β−αβλi1−αλi)(yikxik)=Rk+1(xi0yi0)
We can write out the convergence rates explicitly. The eigenvalues are
σ1σ2=21(1−αλ+β+√(−αλ+β+1)2−4β)=21(1−αλ+β−√(−αλ+β+1)2−4β)
When the (−αλ+β+1)2−4β<0 is less than zero,
then the roots are complex and the convergence rate is
∣σ1∣=∣σ2∣=√(1−αλ+β)2+∣(−αλ+β+1)2−4β∣=2√β
Which is, surprisingly, independent of the step-size or the eigenvalue αλ. When the roots are real, the convergence rate is
max{∣σ1∣,∣σ2∣}=21max{∣1−αλi+β±√(1−αλi+β)2−4β∣}
This can be derived by reducing the inequalities for all 4 + 1 cases in the explicit form of the convergence rate above.
We must optimize over
α,βminmax{∥∥∥∥(β−αβλi1−αλi)∥∥∥∥,…,∥∥∥∥(β−αβλn1−αλn)∥∥∥∥}.
(∥⋅∥
here denotes the magnitude of the maximum eigenvalue), and occurs when
the roots of the characteristic polynomial are repeated for the matrices
corresponding to the extremal eigenvalues.
The above optimization problem is bounded from below by 0, and vector of all 1’s achieve this.
This can be written explicitly as
[LG]ij=⎩⎪⎨⎪⎧degree of vertex i−10i=ji≠j,(i,j) or (j,i)∈Eotherwise
We use the infinity norm to measure our error, similar results can be derived for the 1 and 2 norms.
The momentum iterations are
zk+1wk+1=βzk+Awk+error(wk)=wk−αzk+1.
which, after a change of variables, become
(1α01)(yik+1xik+1)=(β0λi1)(yikxik)+(ϵik0)
Inverting the 2×2 matrix on the left, and applying the formula recursively yields the final solution.
On the 1D function f(x)=2λx2, the objective value is
Ef(xk)=2λE[(xk)2]=2λE(e2TRk(y0x0)+ϵke2Ti=1∑kRk−i(1−α))2=2λe2TRk(y0x0)+2λE(ϵke2Ti=1∑kRk−i(1−α))2=2λe2TRk(y0x0)+2λE[ϵk]⋅i=1∑k(e2TRk−i(1−α))2=2λe2TRk(y0x0)+2λE[ϵk⋅i=1∑kγi2,γi=e2TRk−i(1−α)
The third inequality uses the fact that Eϵk=0 and the fourth uses the fact they are uncorrelated.
References
On the importance of initialization and momentum in deep learning.[PDF] Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139—1147.
Some methods of speeding up the convergence of iteration methods[PDF] Polyak, B.T., 1964. USSR Computational Mathematics and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. DOI: 10.1016/0041-5553(64)90137-5
Theory of gradient methods Rutishauser,
H., 1959. Refined iterative methods for computation of the solution and
the eigenvalues of self-adjoint boundary value problems, pp. 24—49.
Springer. DOI: 10.1007/978-3-0348-7224-9_2
Analysis and design of optimization algorithms via integral quadratic constraints[PDF] Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM.
Introductory lectures on convex optimization: A basic course Nesterov, Y., 2013. , Vol 87. Springer Science \& Business Media. DOI: 10.1007/978-1-4419-8853-9
Natural gradient works efficiently in learning[link] Amari, S., 1998. Neural computation, Vol 10(2), pp. 251—276. MIT Press. DOI: 10.1162/089976698300017746
Deep Learning, NIPS′2015 Tutorial[PDF] Hinton, G., Bengio, Y. and LeCun, Y., 2015.
Adaptive restart for accelerated gradient schemes[PDF] O’Donoghue, B. and Candes, E., 2015. Foundations of computational mathematics, Vol 15(3), pp. 715—732. Springer. DOI: 10.1007/s10208-013-9150-3
The Nth Power of a 2x2 Matrix.[PDF] Williams, K., 1992. Mathematics Magazine, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246
From Averaging to Acceleration, There is Only a Step-size.[PDF] Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658—695.
On the momentum term in gradient descent learning algorithms[PDF] Qian, N., 1999. Neural networks, Vol 12(1), pp. 145—151. Elsevier. DOI: 10.1016/s0893-6080(98)00116-6
Understanding deep learning requires rethinking generalization[PDF] Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.
A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights[PDF] Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Information Processing Systems, pp. 2510—2518.
The Zen of Gradient Descent[HTML] Hardt, M., 2013.
A geometric alternative to Nesterov’s accelerated gradient descent[PDF] Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.
An optimal first order method based on optimal quadratic averaging[PDF] Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.
Linear coupling: An ultimate unification of gradient and mirror descent[PDF] Allen-Zhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.
Accelerating the cubic regularization of Newton’s method on convex problems[PDF] Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. DOI: 10.1007/s10107-006-0089-x
Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub.
The figures that have been reused from other sources don't fall under
this license and can be recognized by a note in their caption: “Figure
from …”.
For attribution in academic contexts, please cite this work as
@article{goh2017why,
author = {Goh, Gabriel},
title = {Why Momentum Really Works},
journal = {Distill},
year = {2017},
url = {http://distill.pub/2017/momentum},
doi = {10.23915/distill.00006}
}
On the importance of initialization and momentum in deep learning.[PDF] I. Sutskever, J. Martens, G.E. Dahl, G.E. Hinton. ICML (3), Vol 28, pp. 1139—1147. 2013.
Some methods of speeding up the convergence of iteration methods[PDF] B.T. Polyak. USSR Computational Mathematics and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. 1964. DOI: 10.1016/0041-5553(64)90137-5
Theory of gradient methods H. Rutishauser. Refined
iterative methods for computation of the solution and the eigenvalues
of self-adjoint boundary value problems, pp. 24—49. Springer. 1959. DOI: 10.1007/978-3-0348-7224-9_2
Analysis and design of optimization algorithms via integral quadratic constraints[PDF] L. Lessard, B. Recht, A. Packard. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM. 2016.
Introductory lectures on convex optimization: A basic course Y. Nesterov. , Vol 87. Springer Science \& Business Media. 2013. DOI: 10.1007/978-1-4419-8853-9
Natural gradient works efficiently in learning[link] S. Amari. Neural computation, Vol 10(2), pp. 251—276. MIT Press. 1998. DOI: 10.1162/089976698300017746
Deep Learning, NIPS′2015 Tutorial[PDF] G. Hinton, Y. Bengio, Y. LeCun. 2015.
Adaptive restart for accelerated gradient schemes[PDF] B. O’Donoghue, E. Candes. Foundations of computational mathematics, Vol 15(3), pp. 715—732. Springer. 2015. DOI: 10.1007/s10208-013-9150-3
The Nth Power of a 2x2 Matrix.[PDF] K. Williams. Mathematics Magazine, Vol 65(5), pp. 336. MAA. 1992. DOI: 10.2307/2691246
From Averaging to Acceleration, There is Only a Step-size.[PDF] N. Flammarion, F.R. Bach. COLT, pp. 658—695. 2015.
On the momentum term in gradient descent learning algorithms[PDF] N. Qian. Neural networks, Vol 12(1), pp. 145—151. Elsevier. 1999. DOI: 10.1016/s0893-6080(98)00116-6
Introductory lectures on convex optimization: A basic course Y. Nesterov. , Vol 87. Springer Science \& Business Media. 2013. DOI: 10.1007/978-1-4419-8853-9
From Averaging to Acceleration, There is Only a Step-size.[PDF] N. Flammarion, F.R. Bach. COLT, pp. 658—695. 2015.
On the importance of initialization and momentum in deep learning.[PDF] I. Sutskever, J. Martens, G.E. Dahl, G.E. Hinton. ICML (3), Vol 28, pp. 1139—1147. 2013.
On the importance of initialization and momentum in deep learning.[PDF] I. Sutskever, J. Martens, G.E. Dahl, G.E. Hinton. ICML (3), Vol 28, pp. 1139—1147. 2013.
Understanding deep learning requires rethinking generalization[PDF] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals. arXiv preprint arXiv:1611.03530. 2016.
A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights[PDF] W. Su, S. Boyd, E. Candes. Advances in Neural Information Processing Systems, pp. 2510—2518. 2014.
Theory of gradient methods H. Rutishauser. Refined
iterative methods for computation of the solution and the eigenvalues
of self-adjoint boundary value problems, pp. 24—49. Springer. 1959. DOI: 10.1007/978-3-0348-7224-9_2
A geometric alternative to Nesterov’s accelerated gradient descent[PDF] S. Bubeck, Y.T. Lee, M. Singh. arXiv preprint arXiv:1506.08187. 2015.
An optimal first order method based on optimal quadratic averaging[PDF] D. Drusvyatskiy, M. Fazel, S. Roy. arXiv preprint arXiv:1604.06543. 2016.
Linear coupling: An ultimate unification of gradient and mirror descent[PDF] Z. Allen-Zhu, L. Orecchia. arXiv preprint arXiv:1407.1537. 2014.
Accelerating the cubic regularization of Newton’s method on convex problems[PDF] Y. Nesterov. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. 2008. DOI: 10.1007/s10107-006-0089-x
It is possible, however, to construct very specific counterexamples
where momentum does not converge, even on convex functions. See [4] for a counterexample.
In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
minimize21∥Zw−d∥2+2η∥w∥2=21wT(ZTZ+ηI)w−(Zd)Tw
Recall that ZTZ=Qdiag(Λ1,…,Λn)QT. The solution to Tikhonov Regression is therefore
(ZTZ+ηI)−1(Zd)=Qdiag(λ1+η1,⋯,λn+η1)QT(Zd)
We can think of regularization as a function which decays the largest eigenvalues, as follows:
Tikhonov Regularized λi=λi+η1=λi1(1−(1+λi/η)−1).
Gradient descent can be seen as employing a similar decay, but with the decay rate
Gradient Descent Regularized λi=λi1(1−(1−αλi)k)
instead. Note that this decay is dependent on the step-size.
This is true as we can write updates in matrix form as
(1α01)(yik+1xik+1)=(β0λi1)(yikxik)
which implies, by inverting the matrix on the left,
(yik+1xik+1)=(β−αβλi1−αλi)(yikxik)=Rk+1(xi0yi0)
We can write out the convergence rates explicitly. The eigenvalues are
σ1σ2=21(1−αλ+β+√(−αλ+β+1)2−4β)=21(1−αλ+β−√(−αλ+β+1)2−4β)
When the (−αλ+β+1)2−4β<0 is less than zero,
then the roots are complex and the convergence rate is
∣σ1∣=∣σ2∣=√(1−αλ+β)2+∣(−αλ+β+1)2−4β∣=2√β
Which is, surprisingly, independent of the step-size or the eigenvalue αλ. When the roots are real, the convergence rate is
max{∣σ1∣,∣σ2∣}=21max{∣1−αλi+β±√(1−αλi+β)2−4β∣}
This can be derived by reducing the inequalities for all 4 + 1 cases in the explicit form of the convergence rate above.
We must optimize over
α,βminmax{∥∥∥∥(β−αβλi1−αλi)∥∥∥∥,…,∥∥∥∥(β−αβλn1−αλn)∥∥∥∥}.
(∥⋅∥
here denotes the magnitude of the maximum eigenvalue), and occurs when
the roots of the characteristic polynomial are repeated for the matrices
corresponding to the extremal eigenvalues.
The above optimization problem is bounded from below by 0, and vector of all 1’s achieve this.
This can be written explicitly as
[LG]ij=⎩⎪⎨⎪⎧degree of vertex i−10i=ji≠j,(i,j) or (j,i)∈Eotherwise
We use the infinity norm to measure our error, similar results can be derived for the 1 and 2 norms.
The momentum iterations are
zk+1wk+1=βzk+Awk+error(wk)=wk−αzk+1.
which, after a change of variables, become
(1α01)(yik+1xik+1)=(β0λi1)(yikxik)+(ϵik0)
Inverting the 2×2 matrix on the left, and applying the formula recursively yields the final solution.
On the 1D function f(x)=2λx2, the objective value is
Ef(xk)=2λE[(xk)2]=2λE(e2TRk(y0x0)+ϵke2Ti=1∑kRk−i(1−α))2=2λe2TRk(y0x0)+2λE(ϵke2Ti=1∑kRk−i(1−α))2=2λe2TRk(y0x0)+2λE[ϵk]⋅i=1∑k(e2TRk−i(1−α))2=2λe2TRk(y0x0)+2λE[ϵk⋅i=1∑kγi2,γi=e2TRk−i(1−α)
The third inequality uses the fact that Eϵk=0 and the fourth uses the fact they are uncorrelated.
Distill
is dedicated to clear explanations of machine learning